AITopics | asr model

Collaborating Authors

asr model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Fit for ourpurpose, not yours: Benchmark for a low-resource, Indigenous language

Neural Information Processing SystemsFeb-11-2026, 04:54:23 GMT

The datasets contain numerous grammatical and orthographic errors, poor pronunciation, limited vocabulary, and the content lacks cultural relevance to the language community.

benchmark, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > Indonesia > Bali (0.04)
South America > Peru (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Lin, Ye Bhone, Aung, Thura, Thu, Ye Kyaw, Oo, Thazin Myint

arXiv.org Artificial IntelligenceNov-27-2025

Abstract--This paper investigates sequence-to-sequence T ransformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IP A and alignment information. T o our knowledge, this is the first study addressing ASR error correction specifically for Burmese. W e evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word-and character-level accuracy over baseline outputs. The proposed AEC model, combining IP A and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

data quality, error correction, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.21088

Country:

Europe (0.46)
Asia > Myanmar (0.16)
Asia > Thailand (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CodeVaani: A Multilingual, Voice-Based Code Learning Assistant

Havare, Jayant, Tamilselvam, Srikanth, Mittal, Ashish, Thorat, Shalaka, Jadia, Soham, Apte, Varsha, Ramakrishnan, Ganesh

arXiv.org Artificial IntelligenceNov-27-2025

Programming education often assumes English proficiency and text-based interaction, creating barriers for students from multilingual regions such as India. We present CodeVaani, a multilingual speech-driven assistant for understanding code, built into Bodhitree [1], a Learning Management System developed at IIT Bombay. It is a voice-enabled assistant that helps learners explore programming concepts in their native languages. The system integrates Indic ASR, a codeaware transcription refinement module, and a code model for generating relevant answers. Responses are provided in both text and audio for natural interaction. In a study with 28 beginner programmers, CodeVaani achieved 75% response accuracy, with over 80% of participants rating the experience positively. Compared to classroom assistance, our framework offers ondemand availability, scalability to support many learners, and multilingual support that lowers the entry barrier for students with limited English proficiency. The demo will illustrate these capabilities and highlight how voice-based AI systems can make programming education more inclusive. Supplementary artifacts and demo video are also made available.

artificial intelligence, natural language, query, (18 more...)

arXiv.org Artificial Intelligence

2511.20654

Country: Asia > India (0.35)

Genre:

Research Report (0.50)
Questionnaire & Opinion Survey (0.49)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)

Add feedback

Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward

Wang, Guansu, Sun, Peijie

arXiv.org Artificial IntelligenceNov-25-2025

Recent advancements in Text-to-Speech (TTS) technology have been remarkable, enabling current models to clone arbitrary unseen speakers and synthesize high-quality, natural-sounding speech. However, corresponding evaluation techniques appear to be lagging: Existing Mean Opinion Score (MOS) estimation models typically perform regression-based scoring on entire speech segments-while a failed synthesized speech usually contains problematic elements in only a few isolated words rather than throughout the entire utterance. In this context, we presents an intriguing finding: encoder-decoder ASR models, such as Whisper, leverage their extensive pre-training to precisely capture word-level mismatches between speech and text within their cross-attention mechanisms, thereby providing a fine-grained reward signal. Building upon this insight, we propose a novel TTS optimization method, which we term Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR). Instead of relying on any explicit reward annotations, W3AR leverages the attention information within a pre-trained ASR model, enabling finer-grained alignment and optimization of the sequences predicted by the TTS model. Experimental results demonstrate that W3AR not only effectively improves the TTS generation quality of existing models but also further enhances zero-shot robustness based on both in-domain and out-of-domain prompt speakers. Additionally, our findings and proposed methodology offer a new insight for generative tasks: understanding models can potentially serve as evaluators, providing highly fine-grained and valuable feedback for generation.

artificial intelligence, arxiv preprint arxiv, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.17555

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Yonatan Belinkov, James Glass

Neural Information Processing SystemsNov-21-2025, 12:23:38 GMT

Neural networks have become ubiquitous in automatic speech recognition systems.

artificial intelligence, machine learning, representation, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(5 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

Synthetic Voice Data for Automatic Speech Recognition in African Languages

DeRenzi, Brian, Dixon, Anna, Farhi, Mohamed Aymane, Resch, Christian

arXiv.org Artificial IntelligenceNov-10-2025

Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5% relative with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Investigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.26615/978-954-452-100-4-016

2507.17578

Country:

North America > United States (1.00)
Africa (1.00)
Asia (0.92)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

Xu, Qianheng

arXiv.org Artificial IntelligenceNov-6-2025

Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.18938

Country: Asia (0.28)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Gu, Zijin, Likhomanenko, Tatiana, Jaitly, Navdeep

arXiv.org Artificial IntelligenceNov-6-2025

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.05724

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

A Neural Model for Contextual Biasing Score Learning and Filtering

Huang, Wanting, Wang, Weiran

arXiv.org Artificial IntelligenceOct-29-2025

Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.23849

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.73)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Filters

Collaborating Authors

asr model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Fit for ourpurpose, not yours: Benchmark for a low-resource, Indigenous language

b597460c506e8e35fb0cc1c1905dd3bc-Paper.pdf

ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

CodeVaani: A Multilingual, Voice-Based Code Learning Assistant

Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Synthetic Voice Data for Automatic Speech Recognition in African Languages

StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

A Neural Model for Contextual Biasing Score Learning and Filtering